Interpretation of User Comments for Detection of Malicious Websites

نویسنده

  • Mehrbod Sharifi
چکیده

Automated understanding of natural language is a challenging problem, which has remained open for decades. We have investigated its special case, focused on identifying relevant concepts in natural-language text in the context of a specific given task. We have developed a set of general-purpose language interpretation techniques and applied them to the task of detecting malicious websites by analyzing comments of website visitors. In this context, concepts are related to behavior or contents of websites, such as presence of pop-ups and false testimonials. The developed algorithms are based on probabilistic topic models and other dimensionality reduction techniques applied to a special case of multi-label text classification, where concepts are output labels. We integrate information about the target task with other relevant information, including relations among concepts and external knowledge sources using a concept graph. The system iterates between training a topic model on the partially labeled data and optimizing the parameters and the label assignments. We analyze several alternative versions of this mechanism, such as one that measures the quality of separation among topics and eliminates words that are not discriminative. For the task of detecting malicious websites, we have developed an approach that applies machine-learning techniques to the automatically collected data about websites and achieves 98% precision and 95% recall. We present a crowdsourcing system for collecting multiple-choice and free-text comments from website visitors, which is especially useful when other sources of information are insufficient or unreliable. We improve detection performance by considering the text features in the comments about the website. This performance gain is greater when using unstructured free-text comments than using multiple-choice comments. Finally, we have evaluated the performance of our language interpretation framework, and shown that the performance gain from the extracted concepts is related to the popularity of the website and task-based concepts are complementary to text features for obscure websites. iii Acknowledgement

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Credit-based Incentive Mechanisms of Voluntary User Comment Reviewing in Social Networks

With the recent advance of micro-blogs and social networks, people can view and post comments on the websites in a very convenient way. However, it is also a big concern that the malicious users keep polluting the cyber environment by scamming, spamming or repeatedly advertising. So far the most common way to detect and report malicious comments is based on voluntary reviewing from honest users...

متن کامل

Anomaly-based Web Attack Detection: The Application of Deep Neural Network Seq2Seq With Attention Mechanism

Today, the use of the Internet and Internet sites has been an integrated part of the people’s lives, and most activities and important data are in the Internet websites. Thus, attempts to intrude into these websites have grown exponentially. Intrusion detection systems (IDS) of web attacks are an approach to protect users. But, these systems are suffering from such drawbacks as low accuracy in ...

متن کامل

Interactive Website Filter for Safe Web Browsing

Though popularly used for safe web browsing, blacklist-based filters have fundamental limitation in the “window of vulnerability”, the time between malicious website launch and blacklist update. An effective way of seamless protection is to use an add-on filter based on heuristics, but most of prior heuristics have offered the limited scope of protection against new attacks. Moreover, they have...

متن کامل

A New Model for Email Spam Detection using Hybrid of Magnetic Optimization Algorithm with Harmony Search Algorithm

Unfortunately, among internet services, users are faced with several unwanted messages that are not even related to their interests and scope, and they contain advertising or even malicious content. Spam email contains a huge collection of infected and malicious advertising emails that harms data destroying and stealing personal information for malicious purposes. In most cases, spam emails con...

متن کامل

A Trolling Hierarchy in Social Media and A Conditional Random Field For Trolling Detection

An-ever increasing number of social media websites, electronic newspapers and Internet forums allow visitors to leave comments for others to read and interact. This exchange is not free from participants with malicious intentions, which do not contribute with the written conversation. Among different communities users adopt strategies to handle such users. In this paper we present a comprehensi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012